##Readme: This folder contains replication files for “Measuring the Spillovers of Venture Capital” published in The Review of Economics and Statistics

Author: Martin Watzinger, LMU Munich, martin.watzinger@gmail.com

This archive contains all do and data files to replicate the figures and tables in the paper.

#Version and software
The code was tested on Stata MP16 and R 4.0 with RTools and RStudio.
The time to run data preparation (0_data_prep.do) is 92 hours on an Intel Xeon E5-1620 v4 @ 3.50GHz with 128 GB RAM. At least 100 GB RAM are required. 

#Necessary packages that have to be installed by the user:
Stata:
* ivreg2, ranktest, plotmatrix, estout, esttab
R:
* Matrix, foreign, MASS, Stats


#Instructions
The dofile "0_data_prep.do" creates from raw data the datasets used in the analyis.
The dofile "1_results.do" creates all tables and figures from the paper. In some tables (Table 2b and Table 3) the numbers are slightly different than in the published paper because I added a random number to some variables to ensure anonymity.

The directory "data" contains all raw datafiles. the directory "proc" contains all final datasets along with all intermediate datasets created during data preparation. Note that the data prep file anonymizes the dataset and adds to some variable a random variable. 

#Datasources
Data has been obtained from the following sources:
VentureExpert - all US VC-backed start-up data 
Compustat North America - data on all established companies
NBER Patent Database (https://sites.google.com/site/patentdataproject/Home/downloads)
PatentViews Database (https://www.patentsview.org/download/)
Schnitzer, Monika, and Martin Watzinger. Standing on the shoulders of science. No. 13766. CEPR Discussion Papers, 2019. (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L2BB9F)
Morrison, Greg, Massimo Riccaboni, and Fabio Pammolli. "Disambiguation of patent inventors and assignees using high-resolution geolocation data." Nature Scientific data 4 (2017): 170064. (https://www.nature.com/articles/sdata201764)

To run the 0_data_prep.do, the data of VentureXpert must be purchased from Thomson Financial and Compustat North America must be purchased from Standard & Poors. The data that is not included is marked with a star (*).
Datasets that must be downloaded from public available file servers are marked with a cross (x). If you need the publicly available data I am happy to send it via data transfers. The total file size is 12 GB. 
All data should be put into the folder 'data'.

#Input datafiles
compustatNorthAmerica.csv (*) - Compustat database
VC_USA (1-10) Name.xls (*) - Data on VC financed start-ups from VentureXpert 
FUNDS.xls (*) - Fundraising data from VentureXpert 
LinkedInventorNameLocData (x) - Geolocated inventor data of Morrison et al (2017)
claim.tsv (x) - Data on patent claims from PatentViews
assignee.dta and pat76_06_assg (x) - Assignee and patent data from from NBER patent database
main_file_novelty - Data from Schnitzer and Watzinger (2019)
rdCosts.dta - R&D subsidies per company
venturePatentName.dta - Match of VentureExpert data to patent data done manually 
area_codes.csv and cities_states - Area codes and cities per state - hand coded from https://en.wikipedia.org/wiki/List_of_North_American_Numbering_Plan_area_codes
patentCategory - NBER category and subcategory per USPTO technology class (nclass)
deflator.dta - Dollar deflator

#Final datafiles
# All variables are labeled and the do files indicates which figure or table is produced
work.dta - Main analysis data file
nclassNclassLikelihood.dta - the cross citation probability between USPTO technology classes 
heatfile1.dta - the cross citation probability between NBER subcategory
ventureDescriptive_aggregated -  # of Start-ups and Share with Patent
patentRaw_aggregated.dta - # of Patents and Citations per Patent and company type
work_aggregated -  Average Scaled Citations per Patent over Time and company type




